More Snakemake: SLURM, Configs, Input Functions, Benchmarks

Eric C. Anderson

Computational Methods for Molecular Biology, SWFSC/CSU

What are we doing here?

This is hands-on section that goes with the narrative chapter, Important Snakemake Embellishments

Topics:

  • Resources for Rules
  • Profiles for submitting to SLURM
  • YAML Configuration and python dicts
  • Tabular configuration and pandas
  • Input functions

We show all of these in an embellished example Snakefile called Snakefile2 in the Snakemake-Example directory. This Snakefile will process all 16 Chinook at 4 Chromosomes from our example data set.

Our Chinook Example: GATK Best Practices “Light”

flowchart TD
  A(fastq files from 16 samples: our raw data) --> B(Trim the reads: fastp)
  B --> C(Map the reads to a reference genome: bwa-mem2)
  C --> D(Mark PCR and optical duplicates: gatk MarkDuplicates)
  D --> E(Make gVCF files for each sample/chromo: gatk HaplotypeCaller)
  E --> F(Load gVCFs into Genomic DB for each chromo: gatk GenomicsDBImport)
  F --> G(Create VCFs from Genomic DB for each chromo: gatk GenotypeGVCFs)
  G --> H(Concatenate chromosome-vcfs into a single vcf: bcftools)

A mini data set that only takes about 25 minutes to run through the major steps of a GATK-like variant calling workflow

  • Chinook salmon sequencing reads (a subset of our course example data).
  • Three paired-end fastqs from samples A, B, and C and data only from four chromosomes.
  • We will trim it, map it, mark duplicates, then make one gVCF file for each combination of individual and chromosome (only two chromosomes).
  • Then, call variants on each of two chromosomes.
  • Then catenate the resulting VCFs into a single VCF file.

Setting up our workspaces

  • Sync your fork’s main branch then pull updates into the main branch of your local clone.
  • We will just be working on a login node, since we will be submitting jobs via Snakemake to SLURM.
  • Note, to allow Snakemake to keep running on the login node after disconnecting you need to be using tmux or screen (or maybe run it with nohup).
  • You should already have a snakemake environment set up, according to the directions here
  • Activate that environment.
  • We do all activities today this from the top level of the repo
# activate env
conda activate snakemake-8.5.3

# To make sure snakemake is working, print the help information
# for snakemake
snakemake --help

# make sure that you are in the top level of the
# con-gen-csu repo:
pwd


# In case we don't have conda environments built here for the
# workflow, do this in the start of class and let it run
snakemake  --conda-create-envs-only  -s Snakemake-Example/Snakefile2
  • Note: If you are using a version of Snakemake < 8.0 (like some of the folks at NMFS) things should all work out, but the actual profiles are a little different.

Updated Snakefile and associated files

  • We can use the Unix tree utility to see what the Snakemake-Example directory contains.
  • Within the Snakemake-Example directory, type tree at the command line. The new files here are:
    • Snakefile2. Much more about that later.
    • A new file named config.yaml
    • A new file called sample-info.tsv
    • A directory slurm with some profiles
  • The smk directory has profiles using Snakemakes officially supported SLURM executor.
  • The jdb directory has profiles for running SLURM with John D. Blischak’s simple-slurm-smk approach.
Snakemake-Example/
├── Snakefile
├── Snakefile2
├── config.yaml
├── data
│   ├── A_R1.fastq.gz
│   ├── A_R2.fastq.gz
│   ├── B_R1.fastq.gz
│   ├── B_R2.fastq.gz
│   ├── C_R1.fastq.gz
│   └── C_R2.fastq.gz
├── envs
│   ├── bcftools.yaml
│   ├── bwa2sam.yaml
│   ├── fastp.yaml
│   └── gatk.yaml
├── resources
│   └── genome.fasta
├── sample-info.tsv
└── slurm
    ├── jdb
    │   ├── alpine
    │   │   ├── config.v8+.yaml
    │   │   ├── config.yaml
    │   │   └── status-sacct-robust.sh
    │   └── sedna
    │       ├── config.v8+.yaml
    │       ├── config.yaml
    │       └── status-sacct-robust.sh
    └── smk
        ├── alpine
        │   └── config.v8+.yaml
        └── sedna
            └── config.v8+.yaml

Eric Demonstrates a Run on SLURM

Students don’t do this yet!

We tell Snakemake to submit jobs via SLURM by including a profile:

Demonstrate first running it on SEDNA. Dry-run:

snakemake  -np -s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/jdb/sedna
  • That dry run shows all rules as “localrules” (seems to be a limitation of the dry-run)

Do the actual run:

snakemake  -p -s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/jdb/sedna
  • Note that this submits jobs via SLURM

  • Check it with myjobs.

  • If I need to cancel the run, <cntrl>-c once will tell Snakemake to start killing all the jobs using scancel. Be patient and let it do its work.

  • When we start it back up, Snakemake knows which files need to be regenerated, even if they were already partially made.

What is this profile?

  • A profile is a directory that contains a file config.yaml inside it, and possibly other files.
  • The config.yaml could be a config.v8+.yaml which is used preferentially by Snakemake versions >= 8.0.
  • This yaml file stores command line options:
    • --option arg becomes
      option: arg
    • Options with no arguments use: option: True
  • These options get put on the command line.
Contents of Snakemake-Example/slurm/jdb/alpine/config.v8+.yaml

executor: cluster-generic
cluster-generic-submit-cmd:
  mkdir -p results/slurm_logs/{rule} &&
  sbatch
    --partition=amilan,csu
    --cpus-per-task={threads}
    --mem={resources.mem_mb}
    --time={resources.time}
    --job-name=smk-{rule}-{wildcards}
    --output=results/slurm_logs/{rule}/{rule}-%j-{wildcards}.out
    --error=results/slurm_logs/{rule}/{rule}-%j-{wildcards}.err
    --parsable
cluster-generic-status-cmd: status-sacct-robust.sh
cluster-generic-cancel-cmd: scancel
cluster-generic-cancel-nargs: 400
# warning, time here is very small.  It works for the small
# example data set, but should be increased for production jobs
default-resources:
  - time="00:30:00"
  - mem_mb=3740
  - tmpdir="results/snake-tmp"
restart-times: 0
max-jobs-per-second: 10
max-status-checks-per-second: 25
local-cores: 1
latency-wait: 60
cores: 2400
jobs: 950
keep-going: True
rerun-incomplete: True
printshellcmds: True
software-deployment-method: conda
rerun-trigger: mtime

The “SLURM” parts of the profile

  • executor: use the generic cluster snakemake executor plugin.
  • cluster-generic-submit-cmd: use this command to submit jobs the cluster. (Note the {resources.time}, etc.)
  • cluster-generic-status-cmd: command snakemake can use to check status of jobs
  • cluster-generic-cancel-cmd: command to use to kill running jobs if we terminate Snakemake (i.e., do <cntrl>-c).
  • default-resources: Unless otherwise noted (in the rule) use these as the compute resources for each job. Note the memory here is customized for Alpine. (The SEDNA profile is customized for SEDNA).
Contents of Snakemake-Example/slurm/jdb/alpine/config.v8+.yaml

executor: cluster-generic
cluster-generic-submit-cmd:
  mkdir -p results/slurm_logs/{rule} &&
  sbatch
    --partition=amilan,csu
    --cpus-per-task={threads}
    --mem={resources.mem_mb}
    --time={resources.time}
    --job-name=smk-{rule}-{wildcards}
    --output=results/slurm_logs/{rule}/{rule}-%j-{wildcards}.out
    --error=results/slurm_logs/{rule}/{rule}-%j-{wildcards}.err
    --parsable
cluster-generic-status-cmd: status-sacct-robust.sh
cluster-generic-cancel-cmd: scancel
cluster-generic-cancel-nargs: 400
# warning, time here is very small.  It works for the small
# example data set, but should be increased for production jobs
default-resources:
  - time="00:30:00"
  - mem_mb=3740
  - tmpdir="results/snake-tmp"
restart-times: 0
max-jobs-per-second: 10
max-status-checks-per-second: 25
local-cores: 1
latency-wait: 60
cores: 2400
jobs: 950
keep-going: True
rerun-incomplete: True
printshellcmds: True
software-deployment-method: conda
rerun-trigger: mtime

JDB’s cluster status command

  • The standard Snakemake profile for SLURM uses a python script to check cluster status, and this seems to take much more time to invoke than just running JDB’s shell script.
  • Main purpose: return running, success, or failed for any SLURM job number.
  • Much of the code here is in place to give SLURM many chances to tell us how the job is doing. (In case SLURM is busy-ish…)
  • You don’t need to ever edit this, most likely, but you probably should make sure it is executable if you add it to an update profile.
Contents of Snakemake-Example/slurm/jdb/alpine/status-sacct-robust.sh
#!/usr/bin/env bash

# Check status of Slurm job. More robust because runs `sacct` multiple times if
# needed

jobid="$1"

if [[ "$jobid" == Submitted ]]
then
  echo smk-simple-slurm: Invalid job ID: "$jobid" >&2
  echo smk-simple-slurm: Did you remember to add the flag --parsable to your sbatch call? >&2
  exit 1
fi

function get_status(){
  sacct -j "$1" --format State --noheader | head -n 1 | awk '{print $1}'
}

for i in {1..20}
do
  output=`get_status "$jobid"`
  if [[ ! -z $output ]]
  then
    break
  else
    sleep 3
  fi
done

if [[ -z $output ]]
then
  echo sacct failed to return the status for jobid "$jobid" >&2
  echo Maybe you need to use scontrol instead? >&2
  exit 1
fi

if [[ $output =~ ^(COMPLETED).* ]]
then
  echo success
elif [[ $output =~ ^(RUNNING|PENDING|COMPLETING|CONFIGURING|SUSPENDED).* ]]
then
  echo running
else
  echo failed
fi

Now, you try it on Alpine (or SEDNA)

  • This will submit a lot of jobs to the cluster, (and we will see if that is a problem—especially if Snakemake is checking job statuses often).

  • Do this on a login node.

  • On Snakemake version >= 8.0 you have to get the snakemake executor plugin for a generic cluster. Don’t do this for Snakemake < 8.0:

    # your snakemake environment must be activated for this to work
    pip install snakemake-executor-plugin-cluster-generic

On Alpine

# do a dry run:
snakemake  -np -s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/jdb/alpine

# do a real run:
snakemake  -p -s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/jdb/alpine

On SEDNA

# do a dry run:
snakemake  -np -s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/jdb/sedna

# do a real run:
snakemake  -p -s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/jdb/sedna

When the job is running

  • Check your queued and running jobs with myjobs or myjobs 40 to see 40 characters of the job name.

  • See that the job names actually make sense. (This is not the case with the official Snakemake slurm executor)

  • Check the slurm logs that are written in results/slurm_logs with reasonable names.

Eric shows the official snakemake slurm executor

This would only work on Version >= 8.0

I will run this on SEDNA:

# before running this, it is necessary to do this (once) while
# the snakemake environment is active:
pip install snakemake-executor-plugin-slurm

# now, I call it with the profile that uses the official
# snakemake slurm plugin.

# Dry run:
snakemake  -np -s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/smk/sedna
 
# Real Run:
snakemake  -p -s Snakemake-Example/Snakefile2 --profile Snakemake-Example/slurm/smk/sedna
  • Check it with myjobs 50. Wow! Those are some Fugly and useless job names:
      JOBID PARTITION                                               NAME       USER ST            TIME  NODES   NODELIST(REASON)  CPUS   MIN_MEMORY   TIME_LIMIT    TIME_LEFT PRIORITY
      867355  standard               f8b21817-fa1d-4a40-905d-2e9028d14c03  eanderson PD            0:00      1         (Priority)     1        4800M      8:00:00      8:00:00 0.99990073288789
      867356  standard               f8b21817-fa1d-4a40-905d-2e9028d14c03  eanderson PD            0:00      1         (Priority)     1        4800M      8:00:00      8:00:00 0.99990073265506
      867357  standard               f8b21817-fa1d-4a40-905d-2e9028d14c03  eanderson PD            0:00      1         (Priority)     1        4800M      8:00:00      8:00:00 0.99990073242222
      867358  standard               f8b21817-fa1d-4a40-905d-2e9028d14c03  eanderson PD            0:00      1         (Priority)     1        4800M      8:00:00      8:00:00 0.99990073218939
      867359  standard               f8b21817-fa1d-4a40-905d-2e9028d14c03  eanderson PD            0:00      1         (Priority)     1        4800M      8:00:00      8:00:00 0.99990073195656

A First High-level View of Snakefile2

Contents of Snakemake-Example/Snakefile2
# start with some stuff to get the Snakemake version
from snakemake import __version__ as snakemake_version
smk_major_version = int(snakemake_version[0])



# import modules as needed. Also import the snakemake
# internal command to load a config file into a dict
import pandas as pd

if smk_major_version >= 8:
  from snakemake.common.configfile import _load_configfile
else:
  from snakemake.io import _load_configfile



# define rules that don't need to be run on a compute node.
# i.e. those that can be run locally.
localrules: all, genome_faidx, genome_dict



### Get a dict named config from Snakemake-Example/config.yaml
configfile: "Snakemake-Example/config.yaml"




### Get the sample info table read into a pandas data frame
sample_table=pd.read_table(config["sample_info"], dtype="str").set_index(
    "sample", drop=False
)



### Transfer values from the yaml and tabular config to
### our familiar lists, SAMPLES and CHROMOS
# Populate our SAMPLES list from the sample_table using a little
# pandas syntax
SAMPLES=sample_table["sample"].unique().tolist()

# Define CHROMOS from the values in the config file
CHROMOS=config["chromos"]



### Input Functins that use the tabular sample_info
# define a function to get the fastq path from the sample_table. This
# returns it as a dict, so we need to unpack it in the rule
def get_fastqs(wildcards):
  fq1=sample_table.loc[ wildcards.sample, "fq1" ]
  fq2=sample_table.loc[ wildcards.sample, "fq2" ]
  return {"r1": fq1, "r2": fq2 }

# define a function for getting the read group information
# from the sample table for each particular sample (according
# to the wildcard value)
def get_read_group(wildcards):
    """Denote sample name and platform in read group."""
    return r"-R '@RG\tID:{sample}_{sample_id}_{library}_{flowcell}_{lane}_{barcode}\tSM:{sample_id}\tPL:{platform}\tLB:{library}\tPU:{flowcell}.{lane}.{barcode}'".format(
        sample=wildcards.sample,
        sample_id=sample_table.loc[(wildcards.sample), "sample_id"],
        platform=sample_table.loc[(wildcards.sample), "platform"],
        library=sample_table.loc[(wildcards.sample), "library"],
        flowcell=sample_table.loc[(wildcards.sample), "flowcell"],
        lane=sample_table.loc[(wildcards.sample), "lane"],
        barcode=sample_table.loc[(wildcards.sample), "barcode"],
    )




### Specify rule "all"
# By default, Snakemake tries to create the input files needed
# for the first rule in the Snakefile, so we define the first
# rule to ask for results/vcf/all.vcf.gz
rule all:
  input: "results/vcf/all.vcf.gz"





rule genome_faidx:
  input:
    "data/genome/genome.fasta",
  output:
    "data/genome/genome.fasta.fai",
  conda:
    "envs/bwa2sam.yaml"
  log:
    "results/logs/genome_faidx.log",
  shell:
    "samtools faidx {input} 2> {log} "


rule genome_dict:
  input:
    "data/genome/genome.fasta",
  output:
    "data/genome/genome.dict",
  conda:
    "envs/bwa2sam.yaml"
  log:
    "results/logs/genome_dict.log",
  shell:
    "samtools dict {input} > {output} 2> {log} "


rule bwa_index:
  input:
    "data/genome/genome.fasta"
  output:
    multiext("data/genome/genome.fasta", ".0123", ".amb", ".ann", ".bwt.2bit.64", ".pac"),
  conda:
    "envs/bwa2sam.yaml"
  log:
    out="results/logs/bwa_index/bwa_index.log",
    err="results/logs/bwa_index/bwa_index.err"
  shell:
    "bwa-mem2 index {input} > {log.out} 2> {log.err} "




rule trim_reads:
  input:
    unpack(get_fastqs) # unpack creates named inputs from the dict that
                       # get_fastqs returns
  output:
    r1="results/trimmed/{sample}_R1.fastq.gz",
    r2="results/trimmed/{sample}_R2.fastq.gz",
    html="results/qc/fastp/{sample}.html",
    json="results/qc/fastp/{sample}.json"
  conda:
    "envs/fastp.yaml"
  log:
    out="results/logs/trim_reads/{sample}.log",
    err="results/logs/trim_reads/{sample}.err",
  params:
    as1=config["params"]["fastp"]["adapter_sequence1"],
    as2=config["params"]["fastp"]["adapter_sequence2"],
    parm=config["params"]["fastp"]["other_options"]
  shell:
    " fastp -i {input.r1} -I {input.r2}       "
    "       -o {output.r1} -O {output.r2}     "
    "       -h {output.html} -j {output.json} "
    "  --adapter_sequence={params.as1}        "
    "  --adapter_sequence_r2={params.as2}     "
    "  {params.parm} > {log.out} 2> {log.err}                         "
    


rule map_reads:
  input:
    r1="results/trimmed/{sample}_R1.fastq.gz",
    r2="results/trimmed/{sample}_R2.fastq.gz",
    genome="data/genome/genome.fasta",
    idx=multiext("data/genome/genome.fasta", ".0123", ".amb", ".ann", ".bwt.2bit.64", ".pac")
  output:
    "results/bam/{sample}.bam"
  conda:
    "envs/bwa2sam.yaml"
  log:
    "results/logs/map_reads/{sample}.log"
  threads: 2
  resources:
    mem_mb=7480,
    time="01:00:00"
  params:
    RG=get_read_group
  shell:
    " (bwa-mem2 mem -t {threads} {params.RG} {input.genome} {input.r1} {input.r2} | "
    " samtools view -u | "
    " samtools sort - > {output}) 2> {log} "




rule mark_duplicates:
  input:
    "results/bam/{sample}.bam"
  output:
    bam="results/mkdup/{sample}.bam",
    bai="results/mkdup/{sample}.bai",
    metrics="results/qc/mkdup_metrics/{sample}.metrics"
  conda:
    "envs/gatk.yaml"
  log:
    "results/logs/mark_duplicates/{sample}.log"
  shell:
    " gatk MarkDuplicates  "
    "  --CREATE_INDEX "
    "  -I {input} "
    "  -O {output.bam} "
    "  -M {output.metrics} > {log} 2>&1 "




rule make_gvcfs_by_chromo:
  input:
    bam="results/mkdup/{sample}.bam",
    bai="results/mkdup/{sample}.bai",
    ref="data/genome/genome.fasta",
    idx="data/genome/genome.dict",
    fai="data/genome/genome.fasta.fai"
  output:
    gvcf="results/gvcf/{chromo}/{sample}.g.vcf.gz",
    idx="results/gvcf/{chromo}/{sample}.g.vcf.gz.tbi",
  conda:
    "envs/gatk.yaml"
  log:
    "results/logs/make_gvcfs_by_chromo/{chromo}/{sample}.log"
  params:
    java_opts="-Xmx4g",
    hmt=config["params"]["gatk"]["HaplotypeCaller"]["hmm_threads"]
  shell:
    " gatk --java-options \"{params.java_opts}\" HaplotypeCaller "
    " -R {input.ref} "
    " -I {input.bam} "
    " -O {output.gvcf} "
    " -L {wildcards.chromo}    "           
    " --native-pair-hmm-threads {params.hmt} " 
    " -ERC GVCF > {log} 2> {log} "




rule import_genomics_db_by_chromo:
  input:
    gvcfs=expand("results/gvcf/{{chromo}}/{s}.g.vcf.gz", s=SAMPLES)
  output:
    gdb=directory("results/genomics_db/{chromo}")
  conda:
    "envs/gatk.yaml"
  log:
    "results/logs/import_genomics_db_by_chromo/{chromo}.log"
  params:
    java_opts="-Xmx4g"
  shell:
    " VS=$(for i in {input.gvcfs}; do echo -V $i; done); "  # make a string like -V file1 -V file2
    " gatk --java-options \"-Xmx4g\" GenomicsDBImport "
    "  $VS  "
    "  --genomicsdb-workspace-path {output.gdb} "
    "  -L  {wildcards.chromo} 2> {log} "




rule vcf_from_gdb_by_chromo:
  input:
    gdb="results/genomics_db/{chromo}",
    ref="data/genome/genome.fasta",
    fai="data/genome/genome.fasta.fai",
    idx="data/genome/genome.dict",
  output:
    vcf="results/chromo_vcfs/{chromo}.vcf.gz",
    idx="results/chromo_vcfs/{chromo}.vcf.gz.tbi",
  conda:
    "envs/gatk.yaml"
  log:
    "results/logs/vcf_from_gdb_by_chromo/{chromo}.txt"
  shell:
    " gatk --java-options \"-Xmx4g\" GenotypeGVCFs "
    "  -R {input.ref}  "
    "  -V gendb://{input.gdb} "
    "  -O {output.vcf} 2> {log} "


rule concat_vcfs:
  input:
    vcfs=expand("results/chromo_vcfs/{c}.vcf.gz", c=CHROMOS)
  output:
    vcf="results/vcf/all.vcf.gz"
  conda:
    "envs/bcftools.yaml"
  log:
    "results/concat_vcfs/all.log"
  shell:
    "bcftools concat -n {input.vcfs} > {output.vcf} 2> {log} "

Setting Resources and Threads in Rules

rule map_reads:
  input:
    r1="results/trimmed/{sample}_R1.fastq.gz",
    r2="results/trimmed/{sample}_R2.fastq.gz",
    genome="data/genome/genome.fasta",
    idx=multiext("data/genome/genome.fasta", ".0123", ".amb", ".ann", ".bwt.2bit.64", ".pac")
  output:
    "results/bam/{sample}.bam"
  conda:
    "envs/bwa2sam.yaml"
  log:
    "results/logs/map_reads/{sample}.log"
  threads: 2
  resources:
    mem_mb=7480,
    time="01:00:00"
  params:
    RG=get_read_group
  shell:
    " (bwa-mem2 mem -t {threads} {params.RG} {input.genome} {input.r1} {input.r2} | "
    " samtools view -u | "
    " samtools sort - > {output}) 2> {log} "

YAML based configuration of workflows

Loading a config file within the Snakefile

### Get a dict named config from Snakemake-Example/config.yaml
configfile: "Snakemake-Example/config.yaml"
  • This loads values from a YAML file into a dict variable named config.

  • Note that this config.yaml is completely different from the config.yaml in the SLURM profile we just looked at.

What does this config.yaml have in it?

Contents of Snakemake-Example/config.yaml

# path to the genome (not used in workflow, but we could have...)
genome_path: "data/genome/genome.fasta"

# path to file with information about samples
sample_info: "Snakemake-Example/sample-info.tsv"

# Put the list of chromosomes we want to do here
chromos: 
  - "NC_037122.1f5t9"
  - "NC_037123.1f10t14"
  - "NC_037124.1f8t16"
  - "NC_037125.1f20t24"

# parameters to be used for different rules/programs
params:
  fastp:
    adapter_sequence1: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    adapter_sequence2: "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    other_options:
      - "--detect_adapter_for_pe"
      - "--cut_right" 
      - "--cut_right_window_size 4"
      - "--cut_right_mean_quality 20"
  gatk:
    HaplotypeCaller:
      hmm_threads: 1

    

Config file variables used in Snakefile

Get the chromosome names out of the config:

# Define CHROMOS from the values in the config file
CHROMOS=config["chromos"]

Config file variables used in Snakefile

rule trim_reads:
  input:
    unpack(get_fastqs) # unpack creates named inputs from the dict that
                       # get_fastqs returns
  output:
    r1="results/trimmed/{sample}_R1.fastq.gz",
    r2="results/trimmed/{sample}_R2.fastq.gz",
    html="results/qc/fastp/{sample}.html",
    json="results/qc/fastp/{sample}.json"
  conda:
    "envs/fastp.yaml"
  log:
    out="results/logs/trim_reads/{sample}.log",
    err="results/logs/trim_reads/{sample}.err",
  params:
    as1=config["params"]["fastp"]["adapter_sequence1"],
    as2=config["params"]["fastp"]["adapter_sequence2"],
    parm=config["params"]["fastp"]["other_options"]
  shell:
    " fastp -i {input.r1} -I {input.r2}       "
    "       -o {output.r1} -O {output.r2}     "
    "       -h {output.html} -j {output.json} "
    "  --adapter_sequence={params.as1}        "
    "  --adapter_sequence_r2={params.as2}     "
    "  {params.parm} > {log.out} 2> {log.err}                         "
    

You can grab values from the config dict variable in the input, output, or params, etc. blocks.

Tabular Configuration in a Snakefile:

In the config:

# path to file with information about samples
sample_info: "Snakemake-Example/sample-info.tsv"
  • Click here to see what that tabular file looks like:

  • Note, we are using actual sample IDs from our lab for these instead of the DPCh_plate1_F10_S70 sorts of sample names.

  • But now, how do we find the right fastq files?

Read it in the snakefile:

### Get the sample info table read into a pandas data frame
sample_table=pd.read_table(config["sample_info"], dtype="str").set_index(
    "sample", drop=False
)



### Transfer values from the yaml and tabular config to
### our familiar lists, SAMPLES and CHROMOS
# Populate our SAMPLES list from the sample_table using a little
# pandas syntax
SAMPLES=sample_table["sample"].unique().tolist()

Input Functions